The search functionality is under construction.

Author Search Result

[Author] Shinji KIMURA(43hit)

21-40hit(43hit)

  • Optimizing Controlling-Value-Based Power Gating with Gate Count and Switching Activity

    Lei CHEN  Shinji KIMURA  

     
    PAPER-Logic Synthesis, Test and Verfication

      Vol:
    E92-A No:12
      Page(s):
    3111-3118

    In this paper, a new heuristic algorithm is proposed to optimize the power domain clustering in controlling-value-based (CV-based) power gating technology. In this algorithm, both the switching activity of sleep signals (p) and the overall numbers of sleep gates (gate count, N) are considered, and the sum of the product of p and N is optimized. The algorithm effectively exerts the total power reduction obtained from the CV-based power gating. Even when the maximum depth is kept to be the same, the proposed algorithm can still achieve power reduction approximately 10% more than that of the prior algorithms. Furthermore, detailed comparison between the proposed heuristic algorithm and other possible heuristic algorithms are also presented. HSPICE simulation results show that over 26% of total power reduction can be obtained by using the new heuristic algorithm. In addition, the effect of dynamic power reduction through the CV-based power gating method and the delay overhead caused by the switching of sleep transistors are also shown in this paper.

  • A Low-Power VLSI Architecture for HEVC De-Quantization and Inverse Transform

    Heming SUN  Dajiang ZHOU  Shuping ZHANG  Shinji KIMURA  

     
    PAPER

      Vol:
    E99-A No:12
      Page(s):
    2375-2387

    In this paper, we present a low-power system for the de-quantization and inverse transform of HEVC. Firstly, we present a low-delay circuit to process the coded results of the syntax elements, and then reduce the number of multipliers from 16 to 4 for the de-quantization process of each 4x4 block. Secondly, we give two efficient data mapping schemes for the memory between de-quantization and inverse transform, and the memory for transpose. Thirdly, the zero information is utilized through the whole system. For two memory parts, the write and read operation of zero blocks/ rows/ coefficients can all be skipped to save the power consumption. The results show that up to 86% power consumption can be saved for the memory part under the configuration of “Random-access” and common QPs. For the logical part, the proposed architecture for de-quantization can reduce 77% area consumption. Overall, our system can support real-time coding for 8K x 4K 120fps video sequences and the normalized area consumption can be reduced by 68% compared with the latest work.

  • Low-Power Motion Estimation Processor with 3D Stacked Memory

    Shuping ZHANG  Jinjia ZHOU  Dajiang ZHOU  Shinji KIMURA  Satoshi GOTO  

     
    PAPER

      Vol:
    E98-A No:7
      Page(s):
    1431-1441

    Motion estimation (ME) is a key encoding component of almost all modern video coding standards. ME contributes significantly to video coding efficiency, but, it also consumes the most power of any component in a video encoder. In this paper, an ME processor with 3D stacked memory architecture is proposed to reduce memory and core power consumption. First, a memory die is designed and stacked with ME die. By adding face-to-face (F2F) pads and through-silicon-via (TSV) definitions, 2D electronic design automation (EDA) tools can be extended to support the proposed 3D stacking architecture. Moreover, a special memory controller is applied to control data transmission and timing between the memory die and the ME processor die. Finally, a 3D physical design is completed for the entire system. This design includes TSV/F2F placement, floor plan optimization, and power network generation. Compared to 2D technology, the number of input/output (IO) pins is reduced by 77%. After optimizing the floor plan of the processor die and memory die, the routing wire lengths are reduced by 13.4% and 50%, respectively. The stacking static random access memory contributes the most power reduction in this work. The simulation results show that the design can support real-time 720p @ 60fps encoding at 8MHz using less than 65mW in power, which is much better compared to the state-of-the-art ME processor.

  • ECC-Based Bit-Write Reduction Code Generation for Non-Volatile Memory

    Masashi TAWADA  Shinji KIMURA  Masao YANAGISAWA  Nozomu TOGAWA  

     
    PAPER-High-Level Synthesis and System-Level Design

      Vol:
    E98-A No:12
      Page(s):
    2494-2504

    Non-volatile memory has many advantages such as high density and low leakage power but it consumes larger writing energy than SRAM. It is quite necessary to reduce writing energy in non-volatile memory design. In this paper, we propose write-reduction codes based on error correcting codes and reduce writing energy in non-volatile memory by decreasing the number of writing bits. When a data is written into a memory cell, we do not write it directly but encode it into a codeword. In our write-reduction codes, every data corresponds to an information vector in an error-correcting code and an information vector corresponds not to a single codeword but a set of write-reduction codewords. Given a writing data and current memory bits, we can deterministically select a particular write-reduction codeword corresponding to the data to be written, where the maximum number of flipped bits are theoretically minimized. Then the number of writing bits into memory cells will also be minimized. Experimental results demonstrate that we have achieved writing-bits reduction by an average of 51% and energy reduction by an average of 33% compared to non-encoded memory.

  • Bit-Length Optimization Method for High-Level Synthesis Based on Non-linear Programming Technique

    Nobuhiro DOI  Takashi HORIYAMA  Masaki NAKANISHI  Shinji KIMURA  

     
    PAPER-System Level Design

      Vol:
    E89-A No:12
      Page(s):
    3427-3434

    High-level synthesis is a novel method to generate a RT-level hardware description automatically from a high-level language such as C, and is used at recent digital circuit design. Floating-point to fixed-point conversion with bit-length optimization is one of the key issues for the area and speed optimization in high-level synthesis. However, the conversion task is a rather tedious work for designers. This paper introduces automatic bit-length optimization method on floating-point to fixed-point conversion for high-level synthesis. The method estimates computational errors statistically, and formalizes an optimization problem as a non-linear problem. The application of NLP technique improves the balancing between computational accuracy and total hardware cost. Various constraints such as unit sharing, maximum bit-length of function units can be modeled easily, too. Experimental result shows that our method is fast compared with typical one, and reduces the hardware area.

  • Distortion Control and Optimization for Lossy Embedded Compression in Video Codec System

    Li GUO  Dajiang ZHOU  Shinji KIMURA  Satoshi GOTO  

     
    PAPER-Coding Theory

      Vol:
    E100-A No:11
      Page(s):
    2416-2424

    For mobile video codecs, the huge energy dissipation for external memory traffic is a critical challenge under the battery power constraint. Lossy embedded compression (EC), as a solution to this challenge, is considered in this paper. While previous studies in lossy EC mostly focused on algorithm optimization to reduce distortion, this work, to the best of our knowledge, is the first one that addresses the distortion control. Firstly, from both theoretical analysis and experiments for distortion optimization, a conclusion is drawn that, at the frame level, allocating memory traffic evenly is a reliable approximation to the optimal solution to minimize quality loss. Then, to reduce the complexity of decoding twice, the distortion between two sequences is estimated by a linear function of that calculated within one sequence. Finally, on the basis of even allocation, the distortion control is proposed to determine the amount of memory traffic according to a given distortion limitation. With the adaptive target setting and estimating function updating in each group of pictures (GOP), the scene change in video stream is supported without adding a detector or retraining process. From experimental results, the proposed distortion control is able to accurately fix the quality loss to the target. Compared to the baseline of negative feedback on non-referred B frames, it achieves about twice memory traffic reduction.

  • Efficient Hybrid Grid Synthesis Method Based on Genetic Algorithm for Power/Ground Network Optimization with Dynamic Signal Consideration

    Yun YANG  Shinji KIMURA  

     
    PAPER-Physical Level Design

      Vol:
    E91-A No:12
      Page(s):
    3431-3442

    This paper proposes an efficient design algorithm for power/ground (P/G) network synthesis with dynamic signal consideration, which is mainly caused by Ldi/dt noise and Cdv/dt decoupling capacitance (DECAP) current in the distribution network. To deal with the nonlinear global optimization under synthesis constraints directly, the genetic algorithm (GA) is introduced. The proposed GA-based synthesis method can avoid the linear transformation loss and the restraint condition complexity in current SLP, SQP, ICG, and random-walk methods. In the proposed Hybrid Grid Synthesis algorithm, the dynamic signal is simulated in the gene disturbance process, and Trapezoidal Modified Euler (TME) method is introduced to realize the precise dynamic time step process. We also use a hybrid-SLP method to reduce the genetic execute time and increase the network synthesis efficiency. Experimental results on given power distribution network show the reduction on layout area and execution time compared with current P/G network synthesis methods.

  • On Gate Level Power Optimization of Combinational Circuits Using Pseudo Power Gating

    Yu JIN  Shinji KIMURA  

     
    PAPER-Physical Level Design

      Vol:
    E95-A No:12
      Page(s):
    2191-2198

    In recent years, the demand for low-power design has remained undiminished. In this paper, a pseudo power gating (SPG) structure using a normal logic cell is proposed to extend the power gating to an ultrafine grained region at the gate level. In the proposed method, the controlling value of a logic element is used to control the switching activity of modules computing other inputs of the element. For each element, there exists a submodule controlled by an input to the element. Power reduction is maximized by controlling the order of the submodule selection. A basic algorithm and a switching activity first algorithm have been developed to optimize the power. In this application, a steady maximum depth constraint is added to prevent the depth increase caused by the insertion of the control signal. In this work, various factors affecting the power consumption of library level circuits with the SPG are determined. In such factors, the occurrence of glitches increases the power consumption and a method to reduce the occurrence of glitches is proposed by considering the parity of inverters. The proposed SPG method was evaluated through the simulation of the netlist extracted from the layout using the VDEC Rohm 0.18 µm process. Experiments on ISCAS'85 benchmarks show that the reduction in total power consumption achieved is 13% on average with a 2.5% circuit delay degradation. Finally, the effectiveness of the proposed method under different primary input statistics is considered.

  • Write Control Method for Nonvolatile Flip-Flops Based on State Transition Analysis

    Naoya OKADA  Yuichi NAKAMURA  Shinji KIMURA  

     
    PAPER

      Vol:
    E96-A No:6
      Page(s):
    1264-1272

    Nonvolatile flip-flop enables leakage power reduction in logic circuits and quick return from standby mode. However, it has limited write endurance, and its power consumption for writing is larger than that of conventional D flip-flop (DFF). For this reason, it is important to reduce the number of write operations. The write operations can be reduced by stopping the clock signal to synchronous flip-flops because write operations are executed only when the clock is applied to the flip-flops. In such clock gating, a method using Exclusive OR (XOR) of the current value and the new value as the control signal is well known. The XOR based method is effective, but there are several cases where the write operations can be reduced even if the current value and the new value are different. The paper proposes a method to detect such unnecessary write operations based on state transition analysis, and proposes a write control method to save power consumption of nonvolatile flip-flops. In the method, redundant bits are detected to reduce the number of write operations. If the next state and the outputs do not depend on some current bit, the bit is redundant and not necessary to write. The method is based on Binary Decision Diagram (BDD) calculation. We construct write control circuits to stop the clock signal by converting BDDs representing a set of states where write operations are unnecessary. Proposed method can be combined with the XOR based method and reduce the total write operations. We apply combined method to some benchmark circuits and estimate the power consumption with Synopsys NanoSim. On average, 15.0% power consumption can be reduced compared with only the XOR based method.

  • Issue Mechanism for Embedded Simultaneous Multithreading Processor

    Chengjie ZANG  Shigeki IMAI  Steven FRANK  Shinji KIMURA  

     
    PAPER

      Vol:
    E91-A No:4
      Page(s):
    1092-1100

    Simultaneous Multithreading (SMT) technology enhances instruction throughput by issuing multiple instructions from multiple threads within one clock cycle. For in-order pipeline to each thread, SMT processors can provide large number of issued instructions close to or surpass than using out-of-order pipeline. In this work, we show an efficient issue logic for predicated instruction sequence with the parallel flag in each instruction, where the predicate register based issue control is adopted and the continuous instructions with the parallel flag of '0' are executed in parallel. The flag is pre-defined by a compiler. Instructions from different threads are issued based on the round-robin order. We also introduce an Instruction Queue skip mechanism for thread if the queue is empty. Using this kind of issue logic, we designed a 6 threads, 7-stage, in-order pipeline processor. Based on this processor, we compare round-robin issue policy (RR(T1-Tn)) with other policies: thread one always has the highest priority (PR(T1)) and thread one or thread n has the highest priority in turn (PR(T1-Tn)). The results show that RR(T1-Tn) policy outperforms others and PR(T1-Tn) is almost the same to RR(T1-Tn) from the point of view of the issued instructions per cycle.

  • The Optimal Architecture Design of Two-Dimension Matrix Multiplication Jumping Systolic Array

    Yun YANG  Shinji KIMURA  

     
    PAPER

      Vol:
    E91-A No:4
      Page(s):
    1101-1111

    This paper proposes an efficient systolic array construction method for optimal planar systolic design of the matrix multiplication. By connection network adjustment among systolic array processing element (PE), the input/output data are jumping in the systolic array for multiplication operation requirements. Various 2-D systolic array topologies, such as square topology and hexagonal topology, have been studied to construct appropriate systolic array configuration and realize high performance matrix multiplication. Based on traditional Kung-Leiserson systolic architecture, the proposed "Jumping Systolic Array (JSA)" algorithm can increase the matrix multiplication speed with less processing elements and few data registers attachment. New systolic arrays, such as square jumping array, redundant dummy latency jumping hexagonal array, and compact parallel flow jumping hexagonal array, are also proposed to improve the concurrent system operation efficiency. Experimental results prove that the JSA algorithm can realize fully concurrent operation and dominate other systolic architectures in the specific systolic array system characteristics, such as band width, matrix complexity, or expansion capability.

  • Parallel Binary Decision Diagram Manipulation

    Shinji KIMURA  Tsutomu IGAKI  Hiromasa HANEDA  

     
    PAPER

      Vol:
    E75-A No:10
      Page(s):
    1255-1262

    The paper describes a parallel algorithm for the manipulation of binary decision diagrams on a shared memory multi-processor system. Binary decision diagrams are very efficient representations of logic functions, and are widely used in computer aided design of logic circuits. Logic operations on logic functions such as AND and OR are reduced to operations on binary decision diagrams representing these functions. Operations on binary decision diagrams are time-consuming in some cases, and a fast manipulation method is needed. As with the manipulation, we focus on the construction of a binary decision diagram from a logic formula, and devised a parallel algorithm for the construction. In the construction, there are many logic operations to be processed, and some of them can be processed in parallel. At first, we introduce an extraction method and a parallel-execution method for such parallelizable operations. This is the parallel execution method for an operation sequence (or a set of operations). To extract more parallelism, we introduce a dynamic expansion method of a logic operation. The dynamic expansion is a method to obtain sub-operations from a logic operation using the modified Shannon's expansion. These sub-operations are executed in parallel and the results of these sub-operations are merged to obtain the result of the original operation. Our parallel algorithm, which is based on the construction of shared binary decision diagrams with the negative edge and the operation cache, is implemented in C on a shared memory multi-processor system Sequent S-81 (CPU 80386 (16 MHz)28, 86.75MB), and applied to multiplier examples and ISCAS benchmarks. The speed-up ratio becomes 14 for multipliers, and becomes 11 for c1908 in ISCAS benchmarks.

  • Multi-Operand Adder Synthesis Targeting FPGAs

    Taeko MATSUNAGA  Shinji KIMURA  Yusuke MATSUNAGA  

     
    PAPER-Logic Synthesis, Test and Verification

      Vol:
    E94-A No:12
      Page(s):
    2579-2586

    Multi-operand adders, which calculates the summation of more than two operands, usually consist of compressor trees which reduce the number of operands to two without any carry propagation, and a carry-propagate adder for the two operands in ASIC implementation. The former part is usually realized using full adders or (3;2) counters like Wallace-trees in ASIC, while adder trees or dedicated hardware are used in FPGA. In this paper, an approach to realize compression trees on FPGAs is proposed. In case of FPGA with m-input LUT, any counters with up to m inputs can be realized with one LUT per an output. Our approach utilizes generalized parallel counters (GPCs) with up to m inputs and synthesizes high-performance compressor trees by setting some intermediate height limits in the compression process like Dadda's multipliers. Experimental results show that the number of GPCs are reduced by up to 22% compared to the existing heuristic. Its effectivity on reduction of delay is also shown against existing approaches on Altera's Stratix III.

  • Timing Verification of Sequential Logic Circuits Based on Controlled Multi-Clock Path Analysis

    Kazuhiro NAKAMURA  Shinji KIMURA  Kazuyoshi TAKAGI  Katsumasa WATANABE  

     
    PAPER-Timing Verification and Optimization

      Vol:
    E81-A No:12
      Page(s):
    2515-2520

    This paper introduces a new kind of false path, which is sensitizable but does not affect the decision of the maximum clock frequency. Such false paths exist in multi-clock operations controlled by waiting states, and the delay time of these paths can be greater than the clock period. This paper proposes a method to detect these waiting false paths based on the symbolic state traversal. In this method, the maximum allowable clock cycle of each path is computed using update cycles of each register.

  • FOREWORD

    Shinji KIMURA  

     
    FOREWORD

      Vol:
    E88-A No:12
      Page(s):
    3273-3273
  • Prciseness of Discrete Time Verification

    Shinji KIMURA  Shunsuke TSUBOTA  Hiromasa HANEDA  

     
    PAPER

      Vol:
    E76-A No:10
      Page(s):
    1755-1759

    The discrete time analysis of logic circuits is usually more efficient than the continuous time analysis, but the preciseness of the discrete time analysis is not guaranteed. The paper shows a method to decide a unit time for a logic circuit under which the analysis result is the same as the result based on the continuous time. The delay time of an element is specified with an interval between the minimum and maximum delay times, and we assume an analysis method which enumerates all possible delay cases under the deisrete time. Our main theorem is as follows: refine the unit time by a factor of 1/2, and if the analysis result with a unit time u and that with a unit time u/2 are the same, then u is the expected unit time.

  • Design of Low-Cost Approximate Multipliers Based on Probability-Driven Inexact Compressors

    Yi GUO  Heming SUN  Ping LEI  Shinji KIMURA  

     
    PAPER

      Vol:
    E102-A No:12
      Page(s):
    1781-1791

    Approximate computing has emerged as a promising approach for error-tolerant applications to improve hardware performance at the cost of some loss of accuracy. Multiplication is a key arithmetic operation in these applications. In this paper, we propose a low-cost approximate multiplier design by employing new probability-driven inexact compressors. This compressor design is introduced to reduce the height of partial product matrix into two rows, based on the probability distribution of the sum result of partial products. To compensate the accuracy loss of the multiplier, a grouped error recovery scheme is proposed and achieves different levels of accuracy. In terms of mean relative error distance (MRED), the accuracy losses of the proposed multipliers are from 1.07% to 7.86%. Compared with the Wallace multiplier using 40nm process, the most accurate variant of the proposed multipliers can reduce power by 59.75% and area by 42.47%. The critical path delay reduction is larger than 12.78%. The proposed multiplier design has a better accuracy-performance trade-off than other designs with comparable accuracy. In addition, the efficiency of the proposed multiplier design is assessed in an image processing application.

  • Theory and Application of Topology-Based Exact Synthesis for Majority-Inverter Graphs

    Xianliang GE  Shinji KIMURA  

     
    PAPER-VLSI Design Technology and CAD

      Pubricized:
    2023/03/03
      Vol:
    E106-A No:9
      Page(s):
    1241-1250

    Majority operation has been paid attention as a basic element of beyond-Moore devices on which logic functions are constructed from Majority elements and inverters. Several optimization methods are developed to reduce the number of elements on Majority-Inverter Graphs (MIGs) but more area and power reduction are required. The paper proposes a new exact synthesis method for MIG based on a new topological constraint using node levels. Possible graph structures are clustered by the levels of input nodes, and all possible structures can be enumerated efficiently in the exact synthesis compared with previous methods. Experimental results show that our method decreases the runtime up to 25.33% compared with the fence-based method, and up to 6.95% with the partial-DAG-based method. Furthermore, our implementation can achieve better performance in size optimization for benchmark suites.

  • Input Data Format for Sparse Matrix in Quantum Annealing Emulator

    Sohei SHIMOMAI  Kei UEDA  Shinji KIMURA  

     
    PAPER-Algorithms and Data Structures

      Pubricized:
    2023/09/25
      Vol:
    E107-A No:3
      Page(s):
    557-565

    Recently, Quantum Annealing (QA) has attracted attention as an efficient algorithm for combinatorial optimization problems. In QA, the input data size becomes large and its reduction is important for accelerating by the hardware emulation since the usable memory size and its bandwidth are limited. The paper proposes the compression method of input sparse matrices for QA emulator. The proposed method uses the sparseness of the coefficient matrix and the reappearance of the same values. An independent table is introduced and data are compressed by the search and registration method of two consecutive data in the value table. The proposed method is applied to Traveling Salesman Problem (TSP) with 32, 64 and 96 cities and Nurse Scheduling Problem (NSP). The proposed method could reduce the amount of data by 1/40 for 96 city TSP and could manage 96 city TSP on the hardware emulator. When applied to NSP, we confirmed the effectiveness of the proposed method by the compression ratio ranging from 1/4 to 1/11.8. The data reduction is also useful for the simulation/emulation performance when using the compressed data directly and 1.9 times faster speed can be found on 96 city TSP than the CSR-based method.

  • Fast SAO Estimation Algorithm and Its Implementation for 8K×4K @ 120 FPS HEVC Encoding

    Jiayi ZHU  Dajiang ZHOU  Shinji KIMURA  Satoshi GOTO  

     
    PAPER-High-Level Synthesis and System-Level Design

      Vol:
    E97-A No:12
      Page(s):
    2488-2497

    High efficiency video coding (HEVC) is the new generation video compression standard. Sample adaptive offset (SAO) is a new compression tool adopted in HEVC which reduces the distortion between original samples and reconstructed samples. SAO estimation is the process of determining SAO parameters in video encoding. It is divided into two phases: statistic collection and parameters determination. There are two difficulties for VLSI implementation of SAO estimation. The first is that there are huge amount of samples to deal with in statistic collection phase. The other is that the complexity of Rate Distortion Optimization (RDO) in parameters determination phase is very high. In this article, a fast SAO estimation algorithm and its corresponding VLSI architecture are proposed. For the first difficulty, we use bitmaps to collect statistics of all the 16 samples in one 4×4 block simultaneously. For the second difficulty, we simplify a series of complicated procedures in HM to balance the algorithms complexity and BD-rate performance. Experimental results show that the proposed algorithm maintains the picture quality improvement. The VLSI design based on this algorithm can be implemented using 156.32K gates, 8,832bits single port RAM for 8bits depth case. It can be synthesized to 400MHz @ 65nm technology and is capable of 8K×4K @ 120fps encoding.

21-40hit(43hit)